Panel: The State of the Art in Thai Language Processing
نویسندگان
چکیده
This paper reviews the current state of technology and research progress in the Thai language processing. It resumes the characteristics of the Thai language and the approaches to overcome the difficulties in each processing task. 1 Some Problematic Issues in the Thai Processing It is obvious that the most fundamental semantic unit in a language is the word. Words are explicitly identified in those languages with word boundaries. In Thai, there is no word boundary. Thai words are implicitly recognized and in many cases, they depend on the individual judgement. This causes a lot of difficulties in the Thai language processing. To illustrate the problem, we employed a classic English example. The segmentation of “GODISNOWHERE”. No. Segmentation Meaning (1) God is now here. God is here. (2) God is no where. God doesn’t exist. (3) God is nowhere. God doesn’t exist. With the different segmentations, (1) and (2) have absolutely opposite meanings. (2) and (3) are ambiguous that nowhere is one word or two words. And the difficulty becomes greatly aggravated when unknown words exist. As a tonal language, a phoneme with different tone has different meaning. Many unique approaches are introduced for both the tone generation in speech synthesis research and tone recognition in speech recognition research. These difficulties propagate to many levels in the language processing area such as lexical acquisition, information retrieval, machine translation, speech processing, etc. Furthermore the similar problem also occurs in the levels of sentence and paragraph. 2 Word and Sentence Segmentation The first and most obvious problem to attack is the problem of word identification and segmentation. For the most part, the Thai language processing relies on manually created dictionaries, which have inconsistencies in defining word units and limitation in the quantity. [1] proposed a word extraction algorithm employing C4.5 with some string features such as entropy and mutual information. They reported a result of 85% in precision and 50% in recall measures. For word segmentation, the longest matching, maximal matching and probabilistic segmentation had been applied in the early research [2], [3]. However, these approaches have some limitations in dealing with unknown words. More advanced techniques of word segmentation captured many language features such as context words, parts of speech, collocations and semantics [4], [5]. These reported about 95-99 % of accuracy. For sentence segmentation, the trigram model was adopted and yielded 85% of accuracy [6]. 3 Machine Translation Currently, there is only one machine translation system available to the public, called ParSit (http://www. links.nectec.or.th/services/parsit), it is a service of English-to-Thai webpage translation. ParSiT is a collaborative work of NECTEC, Thailand and NEC, Japan. This system is based on an interlingual approach MT and the translation accuracy is about 80%. Other approaches such as generate-and-repair [7] and sentence pattern mapping have been also studied [8]. 4 Language Resources The only Thai text corpus available for research use is the ORCHID corpus. ORCHID is a 9-MB Thai part-of-speech tagged corpus initiated by NECTEC, Thailand and Communications Research Laboratory, Japan. ORCHID is available at http://www.links.nectec.or.th /orchid. 5 Research in Thai OCR Frequently used Thai characters are about 80 characters, including alphabets, vowels, tone marks, special marks, and numerals. Thai writing are in 4 levels, without spaces between words, and the problem of similarity among many patterns has made research challenging. Moreover, the use of English and Thai in general Thai text creates many more patterns which must be recognized by OCR. For more than 10 years, there has been a considerable growth in Thai OCR research, especially for “printed character” task. The early proposed approaches focused on structural matching and tended towards neural-networkbased algorithms with input for some special characteristics of Thai characters e.g., curves, heads of characters, and placements. At least 3 commercial products have been launched including “ArnThai” by NECTEC, which claims to achieve 95% recognition performance on clean input. Recent technical improvement of ArnThai has been reported in [9]. Recently, focus has been changed to develop system that are more robust with any unclean scanning input. The approach of using more efficient features, fuzzy algorithms, and document analysis is required in this step. At the same time, “Offline Thai handwritten character recognition” task has been investigated but is only in the research phase of isolated characters. Almost all proposed engines were neural network-based with several styles of input features [10], [11]. There has been a small amount of research on “Online handwritten character recognition”. One attempt was proposed by [12], which was also neural networkbased with chain code input. 6 Thai Speech Technology Regarding speech, Thai, like Chinese, is a tonal language. The tonal perception is important to the meaning of the speech. The research currently being done in speech technology can be divided into 3 major fields: (1) speech analysis, (2) speech recognition and (3) speech synthesis. Most of the research in (1) done by the linguists are on the basic study of Thai phonetics e.g. [13]. In speech recognition, most of the current research [14] focus on the recognition of isolated words. To develop continuous speech recognition, a large-scale speech corpus is needed. The status of practical research on continuous speech recognition is in its initial step with at least one published paper [15]. In contrast to western speech recognition, topics specifying tonal languages or tone recognition have been deeply researched as seen in many papers e.g., [16]. For text-to-speech synthesis, processing the idiosyncrasy of Thai text and handling the tones interplaying with intonation are the topics that make the TTS algorithm for the Thai language differrent from others. In the research, the first successful system was accomplished by [14] and later by NECTEC [15]. Both systems employ the same synthesis technique based on the concatenation of demisyllable inventory units.
منابع مشابه
Native Language Interference in Writing: A case study of Thai EFL learners
AbstractThe interference of the native language in acquiring a foreign language is unavoidable. In an attempt to explore the phenomenon why this occurs, the study was conducted in English as a foreign language writing. The study also investigated how the native language interference occurred in the writing process. In fact, this qualitative study explored the reasons and the process of na...
متن کاملA State of the Art of Thai Language Resources and Thai Language Behavior Analysis and Modeling
As electronic communications is now increasing, the term Natural Language Processing should be considered in the broader aspect of Multi-Language processing system. Observation of the language behavior will provide a good basis for design of computational language model and also creating costeffective solutions to the practical problems. In order to have a good language modeling, the language r...
متن کاملNative Language Interference in Writing: A case study of Thai EFL learners
AbstractThe interference of the native language in acquiring a foreign language is unavoidable. In an attempt to explore the phenomenon why this occurs, the study was conducted in English as a foreign language writing. The study also investigated how the native language interference occurred in the writing process. In fact, this qualitative study explored the reasons and the process of na...
متن کاملIdentification and Prioritization of the State-of-the-Art Technologies in the Management of Iranian Public Libraries
Purpose: State-of-the-art technology refers to the best and latest technological advancement possible at a particular time. Today, public libraries play a key role in the various cultural and social spheres of society. Although various technologies can help to fulfill the basic roles of public libraries correctly and completely, their application in the context of these libraries undoubtedly fa...
متن کاملAsian language processing: current state-of-the-art
Asian language processing presents formidable challenges to achieving multilingualism and multiculturalism in our society. One of the first and most obvious challenges is the multitude and diversity of languages: more than 2,000 languages are listed as languages in Asia by Ethnologue (Gordon, 2005), representing four major language families: Austronesian, Trans-New Guinea, Indo-European, and Si...
متن کاملYardsticks for Evaluating ELT Pod/Vodcasts in Online Materials Development and Their Implications for Teacher Education and Art Assisted Language Learning
ELT online materials development, which is a multifaceted multidisciplinary area, is not welcomed by many teachers, because it is demanding, challenging and confusing. They fear facing new technologies in their teaching sessions to avoid failing or being caught by other audiences. Furthermore, they struggle hard in evaluating their pod/vodcasts. In order to remove the fears and barriers, ...
متن کامل